Global Cancer Patients 2015 - 2024 -> Machine Learning Project

Exploratory Data Analysis and Supervised Machine Learning

Introduction

This project is will use the "global_cancer_patients_2015_2024" from Kaggle which has the below link: "https://www.kaggle.com/datasets/zahidmughal2343/global-cancer-patients-2015-2024/data"

This project will attempt to discover the severity of cancer as a result of the various factors in the dataset such as age, gender, obesity, etc.

It will use K Nearest Neighbours and a Random Forest and compare the methods.

In [49]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
import scipy.stats as stats

import statsmodels.formula.api as smf
import statsmodels.api as sm
# For model building
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score,GridSearchCV

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('data/global_cancer_patients_2015_2024.csv')

# Display basic information about the dataset
df.info()

# Display the first few rows of the dataset
df.head()

df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Patient_ID             50000 non-null  object 
 1   Age                    50000 non-null  int64  
 2   Gender                 50000 non-null  object 
 3   Country_Region         50000 non-null  object 
 4   Year                   50000 non-null  int64  
 5   Genetic_Risk           50000 non-null  float64
 6   Air_Pollution          50000 non-null  float64
 7   Alcohol_Use            50000 non-null  float64
 8   Smoking                50000 non-null  float64
 9   Obesity_Level          50000 non-null  float64
 10  Cancer_Type            50000 non-null  object 
 11  Cancer_Stage           50000 non-null  object 
 12  Treatment_Cost_USD     50000 non-null  float64
 13  Survival_Years         50000 non-null  float64
 14  Target_Severity_Score  50000 non-null  float64
dtypes: float64(8), int64(2), object(5)
memory usage: 5.7+ MB
Out[49]:
Age Year Genetic_Risk Air_Pollution Alcohol_Use Smoking Obesity_Level Treatment_Cost_USD Survival_Years Target_Severity_Score
count 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 54.421540 2019.480520 5.001698 5.010126 5.010880 4.989826 4.991176 52467.298239 5.006462 4.951207
std 20.224451 2.871485 2.885773 2.888399 2.888769 2.881579 2.894504 27363.229379 2.883335 1.199677
min 20.000000 2015.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5000.050000 0.000000 0.900000
25% 37.000000 2017.000000 2.500000 2.500000 2.500000 2.500000 2.500000 28686.225000 2.500000 4.120000
50% 54.000000 2019.000000 5.000000 5.000000 5.000000 5.000000 5.000000 52474.310000 5.000000 4.950000
75% 72.000000 2022.000000 7.500000 7.500000 7.500000 7.500000 7.500000 76232.720000 7.500000 5.780000
max 89.000000 2024.000000 10.000000 10.000000 10.000000 10.000000 10.000000 99999.840000 10.000000 9.160000

Exploratory Data Analysis

In [3]:
# Can we track a patient through given years
df.groupby('Patient_ID')['Gender'].count()

#What's in the different columns
# for c in df.columns[1:]:
#     print(c, df[c].unique())
    


fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(df.drop(columns=['Patient_ID', 'Year']).corr(), robust=True , annot=True, ax = ax, fmt=".1f")

sns.pairplot(df, diag_kind='kde')
Out[3]:
<seaborn.axisgrid.PairGrid at 0x7264bb4f2b90>
In [16]:
# sns.displot(
#     df, x="Target_Severity_Score", col="Cancer_Type", row="sex",
#     binwidth=3, height=3, facet_kws=dict(margin_titles=True),
# )

g = sns.FacetGrid(df, col="Gender", height=5)
g.map_dataframe(sns.violinplot, y="Target_Severity_Score", x="Cancer_Type")

for ax in g.axes.flat:
    for label in ax.get_xticklabels():
        label.set_rotation(45)

Comments on the Data

The Data appears like it has been Machine Generated. There are no correlations between anything except the cancer severity and the prediction variables. Female's don't have prostates so I imagine that it would be difficult for there to be such an even distribution of many women with prostate cancer. We can also see the randomness in the scatterplots between the different variables. There are no examples of no cancer in this dataset so the objective is to predict the severity of the cancer rather than the presence.

Breaking the data into train and test sets

In [40]:
df2 =  df.drop(columns=['Patient_ID', 'Year', 'Country_Region', 'Survival_Years', 'Treatment_Cost_USD', 'Gender' ,'Cancer_Type'])

df2.insert(1, "Cancer_Stage_no",0)

df2.loc[df2.Cancer_Stage == "Stage I", 'Cancer_Stage_no'] = 1
df2.loc[df2.Cancer_Stage == "Stage II", 'Cancer_Stage_no'] = 2
df2.loc[df2.Cancer_Stage == "Stage III", 'Cancer_Stage_no'] = 3
df2.loc[df2.Cancer_Stage == "Stage IV", 'Cancer_Stage_no'] = 4

    
df2 =  df2.drop(columns=['Cancer_Stage'])
    
X_train, X_test, y_train, y_test = train_test_split(df2.loc[:, df2.columns != 'Target_Severity_Score'], df2.loc[:, df2.columns == 'Target_Severity_Score'], test_size=.2)
  
X_train.head()
Out[40]:
Age Cancer_Stage_no Genetic_Risk Air_Pollution Alcohol_Use Smoking Obesity_Level
19625 69 3 3.5 1.7 5.6 9.5 6.8
20301 48 1 9.8 5.6 3.4 2.1 10.0
40286 46 2 2.7 7.9 7.7 3.2 9.0
35131 46 2 6.8 10.0 2.5 7.6 4.6
194 42 4 6.6 0.6 7.8 8.1 1.5

KNN - regression

Unlike the work done in our homework, this is a regression problem.

In [45]:
pot_neigh = [1,2,3,4,5,10,20,50]

r2_scores = []
r2_train_scores = []
MSE_scores = []
MSE_train_scores = []

for i in pot_neigh:
    neighbourino = KNeighborsRegressor(n_neighbors=i)
    neighbourino.fit(X_train, y_train)
    Y_pred = neighbourino.predict(X_test)
    Y_train_pred = neighbourino.predict(X_train)
    r2_scores.append(r2_score(y_test, Y_pred))
    r2_train_scores.append(r2_score(y_train, Y_train_pred))
    MSE_scores.append(mean_squared_error(y_test, Y_pred))
    MSE_train_scores.append(mean_squared_error(y_train, Y_train_pred))
In [47]:
plt.plot(pot_neigh, r2_scores, label="Test Data", color = 'blue')
plt.plot(pot_neigh, r2_train_scores, label="Train Data", color = 'red')

# Add labels and title
plt.xlabel('Number of Neighbours')
plt.ylabel('r2')
plt.title('impact of Number of Neighbours on R2')
plt.legend(loc="upper right")
# Show plot
plt.show()


plt.plot(pot_neigh, MSE_scores, label="Test Data", color = 'blue')
plt.plot(pot_neigh, MSE_train_scores, label="Train Data", color = 'red')

# Add labels and title
plt.xlabel('Number of Neighbours')
plt.ylabel('MSE')
plt.title('impact of Number of Neighbours on MSE')
plt.legend(loc="upper right")
# Show plot
plt.show()

KNN Conclusion

Although the train data MSE and R squared values both preffered a model with less neighbours, that increased the variance of the model and resulted in poorer results in the out of sample predictions. This is a clear example of where including bias for a more robust model is appropriate.

Random Forest Approach

Random Forests are an amalgamation of individual decision trees.

In [51]:
griddy=[]
RF = RandomForestRegressor()



depthy = [1,3,5,8,10]
nesty = [10,50,100,200]
parammers={'max_depth':depthy, 'n_estimators':nesty} 
grid_search = GridSearchCV(estimator=RF,param_grid=parammers, cv=5, scoring='r2') 
grid_search.fit(X_train, y_train)
Out[51]:
GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [1, 3, 5, 8, 10],
                         'n_estimators': [10, 50, 100, 200]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='r2', verbose=0)
In [52]:
results = pd.DataFrame(grid_search.cv_results_)

pivot_table = results.pivot_table(
    values='mean_test_score', 
    index='param_max_depth', 
    columns='param_n_estimators'
)

plt.figure(figsize=(8,6))
sns.heatmap(pivot_table, annot=True, fmt=".3f", cmap='viridis')
plt.title('Grid Search CV Results')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.show()
In [63]:
best_model = grid_search.best_estimator_
Y_RF_pred = best_model.predict(X_test)
RF_R2 = r2_score(y_test, Y_RF_pred)
RF_MSE = mean_squared_error(y_test, Y_RF_pred)
print(best_model.get_params())
print("Random forest Best Model R Squared: {:.3f}".format(RF_R2))
print("Random forest Best Model Mean Squared Error: {:.3f}".format(RF_MSE))

# print(pd.DataFrame(grid_search.cv_results_))
RF2 = RandomForestRegressor(n_estimators= 50,  max_depth=8)
alt_model = RF2.fit(X_train, y_train)

Y_RF_pred_alt = alt_model.predict(X_test)
RF_R2_alt = r2_score(y_test, Y_RF_pred_alt)
RF_MSE_alt = mean_squared_error(y_test, Y_RF_pred_alt)

print("Random forest Alt Model R Squared: {:.3f}".format(RF_R2_alt))
print("Random forest Alt Model Mean Squared Error: {:.3f}".format(RF_MSE_alt))
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 10, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
Random forest Best Model R Squared: 0.774
Random forest Best Model Mean Squared Error: 0.327
Random forest Alt Model R Squared: 0.750
Random forest Alt Model Mean Squared Error: 0.362

Random Forest Conclusion

The random forest gives similar results to the KNN approach, however the computing power required for the approach appears to be far more. Unlike with the simplified approach that I applied for grid search in KNN, the best model was the best out of sample model in the Random forest approach. This is likely due to cv feature in grid_search which already accounts for out of sample performance.